Introduction to data mining:

Data mining is an automated process of searching data for relationships and patterns.

Data mining is often done under the supervision of a human agent. The computer using data mining algorithms will identify possible patterns in data and the human agent will visually check to see if the patterns are relevant.

Applications of data mining include credit risk assessment in financial institutions, fraud detection in credit card companies, sales analysis for retail industries and any company that collects large amounts of historical data.

Data mining is also very useful whenever the size of data collected in a database makes manual data analysis tedious.

Data mining can be used to improve revenue and reduce costs in a company. An example is a company that uses postal mail to market products to customers. This company can use data mining to select the attributes of customers that best predict if they will buy a product from a mailing campaign. Traditionally the company may only be using customer’s income to target the mailings. However the data mining process may discover that marital status, age and income are much better predictors of the chances of a customer buying a product from a mailing campaign than income alone. Based on this information the mailing company could target their mailing to a more select group of customers, thereby reducing their costs and increasing their profits. Also the data mining process may discover that there is also a cluster of highly affluent single parents in a particular location. This cluster of customers may be responding to the mailings at an unusually high rate. This could alert the marketing manager of the mailing company to the existence of a highly profitable market segment for the company.

Note: data mining does not replace the need for a good understanding of the data and business, and sound judgment in making evaluations between the spurious and significant knowledge discoveries from your data.

The increasing availability of cheaper and more powerful computer processing power means that the ability to analyze even terabytes of data is affordable to many companies. Although the data mining process for very large databases could take weeks on a slow computer, the application of powerful multi processor capable networked computers with gigabytes of memory and gigabytes of processing power to the very large databases (VLDB) is providing satisfactory performance for a lot of cases.

Although there are data mining tools that you could use without knowing the details of how it works, the knowledge could at least be valuable in selecting between the various data mining tools that are available.

There are many data mining algorithms and techniques used by the machine learning community and one of them known as One Rule (1Rule) is implemented below in Visual Basic.

Note: This article presents the Pseudo Code for a data-mining algorithm and the Visual Basic implementation of the algorithm. The Visual Basic algorithm may be implemented for any Relational Database Management System (RDBMS) including Microsoft® Access®, Microsoft® SQL Server, Oracle® and Sybase® databases.

Author’s Bio:

The author is an engineering graduate specializing in database and data mining programming and consulting. He can be reached on email at public@msaccessguru.com or at his website http://www.msaccessguru.com/.

Full Source Code:

THE FULL SOURCE CODE AND FILES FOR THIS ARTICLE ARE AVAILABLE FREE AT MSACCESSGURU.COM.

Legal Notice:

This article and accompanying algorithm and source code is distributed without any warranty as to the fitness of the algorithm, article or source code. The use or distribution of this article, algorithm and source code is free provided that it is used or distributed along with this article and without any changes to the article. No claim is made as to the accuracy or fitness of this algorithm or source code. The use of this algorithm, article or source code is at your own risk and choice and the author is not liable in anyway for its use or damages that may occur as a result of its use.

Msaccessguru.com retains full copyrights over the article and the accompanying source code.

1Rule (1R) algorithm for data mining:

1Rule creates one data mining rule for your data based on one attribute (column in a database table). It chooses the rule that gives the lowest classification error after comparing the error rates from all the attributes. The rule will assign each distinct value of only one chosen attribute to one category or class. This rule can be defined in Pseudo Code as :

For each attribute in the data set

For each distinct value of the attribute

Find the most frequent classification

Assign the classification to the value

Calculate the error rate for the value

Calculate the total error rate for the attribute

Choose the attribute with the lowest error rate

Create one rule for the chosen attribute

The rule for the chosen attribute is the one rule (1R) for the data set.

A Visual Basic Data Mining Implementation (The Lenses Database):

The data used is from a lenses database² for fitting contact lenses. This database was obtained from the UCI Machine Learning Repository¹.

The Lenses database has 5 attributes (columns in a database table), 24 instances (rows in the table) of data and 3 classes (hard contact lenses, soft contact lenses or no contact lenses).

The goal of One Rule (1Rule) data mining in this implementation is to classify each of the attributes Patient_Age, Spectacle_Prescription, Astigmatic and Tear_Production_Rate of the Lenses database as No_contact_lenses, Soft_contact_lenses or Hard_contact_lenses.

Overview of Data Mining For The Lenses Data:

The source files includes 3 files named Index, lenses.data and lenses.name.

lenses.data is a data file containing all possible numeric combinations of the attributes (columns in a database table) values.

The Lenses data was imported from Lenses.data into the dbo_lenses table in Microsoft® Access®. The column names and data values where obtained from the file lenses.name.

The numeric data in the Microsoft® Access® table dbo_lenses was converted to a lenses table containing descriptive labels for the data using values from lenses.names.

The dbo_lenses table or lenses table is used as the final data set needed by the Visual Basic implementation of 1Rule data mining algorithm.

Data Mining Diagram For The Lenses Data:

Figure 1

The first step to implementing 1Rule is to note which attributes from the lenses table (see Appendix A: Figure 3) will be used to create the best one rule for the data set. Patient_Age, Spectacle_Prescription, Astigmatic and Tear_Production_Rate were chosen. Row_Number is not chosen because it is the Primary Key or Row Identifier for the lenses table. Contact_Lenses is not chosen because it contains the categories (No_contact_lenses, Soft_contact_lenses or Hard_contact_lenses) used to classify the data. Example Patient_Age attribute (column in lenses table) is classified as No_contact_lenses.

The second step is to list the distinct values of each attribute. For the lenses table (see Appendix A: Figure 3),

Patient_Age attribute (column in the lenses table) has 3 distinct values:

Young, Pre-presbyopic and Presbyopic.

Spectacle_Prescription attribute (column in the lenses table) has 2 distinct values:

Myope and Hypermetrophe.

Astigmatic attribute (column in the lenses table) has 2 distinct values:

No and Yes.

Tear_Production_Rate attribute (column in the lenses table) has 2 distinct values:

Reduced and Normal.

The third step is to find the most frequent classification for each distinct value of an attribute’s using the Contact_Lenses attribute values (No_contact_lenses, Soft_contact_lenses or Hard_contact_lenses).

Patient_Age attribute (column in the lenses table):

Young has 8 instances (rows in lenses table) classified as:

No_contact_lenses = 4 instances (rows)

Soft_contact_lenses = 2 instances (rows)

Hard_contact_lenses = 2 instances (rows)

The most frequent classification is No_contact_lenses (4 instances).

We then make a rule classifying the Young attribute value as No_contact_lenses

(Young à No_Contact_Lenses).

The Error Rate for theYoung attribute is the number of times it appears in the data set (8 instances) – the number of instances of it’s most frequent class (4 instances).

Error Rate for the Young attribute is 4/8.

The next step is to repeat the calculations above for the Pre-presbyopic value.

Pre-presbyopic has 8 instances (rows in lenses table) classified as:

No_contact_lenses = 5 instances (rows)

Soft_contact_lenses = 2 instances (rows)

Hard_contact_lenses = 1 instance (row)

The most frequent classification is No_contact_lenses (5 instances).

We then make a rule classifying the Pre-presbyopic attribute value as No_contact_lenses

(Pre-presbyopic à No_Contact_Lenses).

The Error Rate for the Pre-presbyopic attribute is the number of times it appears in the data set (8 instances) – the number of instances of it’s most frequent class (5 instances).

Error Rate for the Pre-presbyopic attribute is 3/8.

The final step is to repeat the calculations above for the Presbyopic value.

Presbyopic has 8 instances (rows in lenses table) classified as:

No_contact_lenses = 6 instances (rows)

Soft_contact_lenses = 1 instance (row)

Hard_contact_lenses = 1 instance (row)

The most frequent classification is No_contact_lenses (6 instances).

We then make a rule classifying the Presbyopic attribute value as No_contact_lenses

(Presbyopic à No_Contact_Lenses).

The Error Rate for the Presbyopic attribute is the number of times it appears in the data set (8 instances) – the number of instances of it’s most frequent class (6 instances).

Error Rate for the Presbyopic attribute is 2/8.

The Total Error Rate for the Patient_Age attribute is = 4/8 + 3/8 + 2/8.

The Total Error Rate for the Patient_Age attribute is = (4 + 3 + 2)/24

The Total Error Rate for the Patient_Age attribute is 9/24.

The fourth step is to repeat the above three steps for the remaining attributes Spectacle_Prescription, Astigmatic and Tear_Production_Rate to obtain their Classification and Total Error Rates.

The Total Error Rates for each attribute:

Patient_Age 9/24

Spectacle_Prescription 9/24

Astigmatic 9/24

Tear_Production_Rate 7/24

The fifth step is to choose the attribute with lowest error rate.

Tear_Production_Rate has the lowest error rate of 7/24.

The final step is to create One Rule (1Rule) based on this attribute. Classifying the attribute Tear_Production_Rate using the methodology of the previous exercise for Patient_Age attribute (first step to third step) gives the classification rule:

Reduced à No_contact_lenses and Normal à Soft_contact_lenses.

The One Rule (1Rule) chosen for the lenses table is:

Tear_Production_Rate

Reduced à No_contact_lenses

Normal à Soft_contact_lenses

This process could become very tedious for a few thousands rows of data and has been automated in Visual Basic as described in the rest of the article.

Visual Basic procedure for data mining based on One Rule (1R):

The Visual Basic project is implemented as 2 classes and 2 modules.

The classes are cDataAccess and c1Rule and the modules are modMain and modDataMining.

Descriptions of the classes and modules:

modDataMining:

This module implements the procedure Create_1Rule() that creates One Rule (1R) for the data.

Create_1Rule() initializes the SQL string used to access the Relational Database Management System (RDBMS) and the ADO connection string used to open a valid ADO connection to the Relational Database Management System (RDBMS). It then executes the SQL string against the RDBMS to create an ADO recordset using the ExecuteSQL() function of the cDataAccess class.

Create_1Rule() then makes a call to the ChooseBestRule() procedure of the c1Rule class. This procedure creates a rule describing the attribute (column in a database table) that has the lowest error rate.

The results of the call are then saved to XML with a call to the WriteToXML() function of the modMain module.

Visual Basic Data Mining with 1Rule:

Figure 2

cDataAccess:

This is the data access class. It uses the Microsoft® ActiveX Data Objects 2.1 library (ADO 2.1 or higher library) from msadotlb21.tlb.

The main purpose of the class is to create an ADO connection to any Relational Database Management System (RDBMS) like Microsoft® Access® or Microsoft® SQL Server for which there is an ADO provider.

A call to the ExecuteSQL() function returns an ADO recordset. The function call accepts the following parameters;

QueryString – a SQL SELECT string

Connect - an ADO connection string

Conn - an ADO connection

CursorLocation - the location of the cursor (client or server)

CursorType - the type of cursor to be used

LockType - The type of lock placed on records during editing

ReturnRecordset - ADO recordset returned form the RDBMS

The ExecuteSQL function accepts among other parameters a SQL select string to run on the database, example “SELECT PATIENT_AGE, SPECTACLE_PRESCRIPTION, ASTIGMATIC, TEAR_PRODUCTION_RATE, CONTACT_LENSES FROM DBO_LENSES” and returns an ADO recordset to the calling function.

modMain:

This contains the function WriteToXML() that saves an ADO recordset to XML.

The function WriteToXML(rst As ADODB.Recordset, strPath As String) As Boolean requires two parameters;

rst - the ADO recordset to save

strPath - a valid file path to save the ADO recordset to

c1Rule:

This class implements the One Rule (1Rule) data mining algorithm and procedures. The procedures and functions it implements include:

AddFilter()
Adds a value to a data structure (recordset) if the value does not already exist in it. The values added will be used as a filter.

AssignClass()
Calls the CreateValueClass().

ChooseBestRule()
Creates a rule describing the attribute (column in a database table) with the lowest error rate.

CreateAddFilter()
Creates a data structure (attribute and value data) that stores unique values used in filtering other data structures.

CreateClassRules()
Creates a data structure which stores an attribute and the class it has been assigned to.

CreateDataSource()
Creates data structures for storing an attribute, its value, the class it is assigned to, its frequency and total.

CreateValueClass()
Checks if a value and class in a data set exists in a data structure, if it exists, it increases the frequency of the value and class in the data structure by 1 and if it does not exist, it adds the current value, class and frequency of 1 to the data structure.

FindErrorRate()
Calculates the error rate for each attribute.

Glossary of Data Mining Terms:

Attribute:
A feature of your data set used to measure and describe it. An example is a column of a database table.

Classification:
A machine learning summary or description of a data set. The class of an instance of data is the category it belongs to.

Data Mining:
Data mining is the process of discovering hidden and useful patterns and relationships in your data.

Instance:
A specific and individual row in a data set. An example is a row of data in a database table.

Acknowledgements:

Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science.
Lenses database donated by Benoit Julien. Database has 5 attributes, 24 instances of data and 3 classes (hard contact lenses, soft contact lenses or no contact lenses).

APPENDIX A:

The Lenses data used for data mining. Each instance is classified as No_contact_lenses, Soft_contact_lenses or Hard_contact_lenses. The attributes used for classification are Patient_Age, Spectacle_Prescription, Astigmatic and Tear_Production_Rate.

Figure 3 - lenses table:

ROW_NUMBER	PATIENT_AGE	SPECTACLE_PRESCRIPTION	ASTIGMATIC	TEAR_PRODUCTION_RATE	CONTACT_LENSES
1	Young	Myope	No	Reduced	No_contact_lenses
2	Young	Myope	No	Normal	Soft_contact_lenses
3	Young	Myope	Yes	Reduced	No_contact_lenses
4	Young	Myope	Yes	Normal	Hard_contact_lenses
5	Young	Hypermetrophe	No	Reduced	No_contact_lenses
6	Young	Hypermetrophe	No	Normal	Soft_contact_lenses
7	Young	Hypermetrophe	Yes	Reduced	No_contact_lenses
8	Young	Hypermetrophe	Yes	Normal	Hard_contact_lenses
9	Pre-presbyopic	Myope	No	Reduced	No_contact_lenses
10	Pre-presbyopic	Myope	No	Normal	Soft_contact_lenses
11	Pre-presbyopic	Myope	Yes	Reduced	No_contact_lenses
12	Pre-presbyopic	Myope	Yes	Normal	Hard_contact_lenses
13	Pre-presbyopic	Hypermetrophe	No	Reduced	No_contact_lenses
14	Pre-presbyopic	Hypermetrophe	No	Normal	Soft_contact_lenses
15	Pre-presbyopic	Hypermetrophe	Yes	Reduced	No_contact_lenses
16	Pre-presbyopic	Hypermetrophe	Yes	Normal	No_contact_lenses
17	Presbyopic	Myope	No	Reduced	No_contact_lenses
18	Presbyopic	Myope	No	Normal	No_contact_lenses
19	Presbyopic	Myope	Yes	Reduced	No_contact_lenses
20	Presbyopic	Myope	Yes	Normal	Hard_contact_lenses
21	Presbyopic	Hypermetrophe	No	Reduced	No_contact_lenses
22	Presbyopic	Hypermetrophe	No	Normal	Soft_contact_lenses
23	Presbyopic	Hypermetrophe	Yes	Reduced	No_contact_lenses
24	Presbyopic	Hypermetrophe	Yes	Normal	No_contact_lenses

APPENDIX A contd:

Figure 4 – dbo_lenses table:

ROW_NUMBER	PATIENT_AGE	SPECTACLE_PRESCRIPTION	ASTIGMATIC	TEAR_PRODUCTION_RATE	CONTACT_LENSES
1	1	1	1	1	3
2	1	1	1	2	2
3	1	1	2	1	3
4	1	1	2	2	1
5	1	2	1	1	3
6	1	2	1	2	2
7	1	2	2	1	3
8	1	2	2	2	1
9	2	1	1	1	3
10	2	1	1	2	2
11	2	1	2	1	3
12	2	1	2	2	1
13	2	2	1	1	3
14	2	2	1	2	2
15	2	2	2	1	3
16	2	2	2	2	3
17	3	1	1	1	3
18	3	1	1	2	3
19	3	1	2	1	3
20	3	1	2	2	1
21	3	2	1	1	3
22	3	2	1	2	2
23	3	2	2	1	3
24	3	2	2	2	3

APPENDIX B: One Rule (1Rule) Algorithm Flowchart

APPENDIX C:
The following files accompany this article:

1. A Visual Basic project

2. A Microsoft Access 2000 database

3. A HTML version of this artcle

4. A Microsoft Word 2000 version of this article

5. 3 files named as Index, lenses.data and lenses.names

PLEASE CONTACT MSACCESSGURU.COM IF THE ACCOMPANYING FILES ARE MISSING.

THE FULL SOURCE CODE AND FILES FOR THE ARTICLE ARE AVAILABLE FREE AT MSACCESSGURU.COM.